Parallel Duplicate Detection in Adverse Drug Reaction Databases with Spark
نویسندگان
چکیده
The World Health Organization (WHO) and drug regulators in many countries maintain databases for adverse drug reaction reports. Data duplication is a significant problem in such databases as reports often come from a variety of sources. Most duplicate detection techniques either have limitations on handling large amount of data or lack effective means to deal with data with imbalanced label distribution. In this paper, we propose a scalable duplicate detection method built on top of Spark to address these problems. Our method uses the kNN (k nearest neighbors) classifier to identify labelled report pairs that are most useful for classifying new report pairs. To deal with the high computational cost of kNN, we partition the labelled data into clusters for parallel computing. We give a method to minimize the crosscluster kNN search. Our experimental results show that the proposed method is able to produce robust duplicate detection results and scalable performance.
منابع مشابه
Improving the measurement and detection of serious adverse drug reactions in databases of stored electronic health records
....................................................................................................................................... 3 1 Background ....................................................................................................................... 15 1.1 Introduction ...........................................................................................................
متن کاملDdup - towards a deduplication framework utilising apache spark
This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14]...
متن کاملAmodiaquine-Associated Asthenia: A Case Based Review and Gaps in Literature
Introduction: Amodiaquine is a partner drug in the artemisinin-based combination therapy artesunate-amodiaquine. Reports of the adverse drug reaction known as amodiaquine-associated asthenia are scarce, and this adverse reaction needs to be investigated in detail. This article presents and reviews a case of amodiaquine-associated asthenia. A literature search for the characteri...
متن کاملStatistical methods for knowledge discovery in adverse drug reaction surveillance Statistical methods for knowledge discovery in adverse drug reaction surveillance
Collections of individual case safety reports are the main resource for early discovery of unknown adverse reactions to drugs once they have been introduced to the general public. The data sets involved are complex and based on voluntary submission of reports, but contain pieces of very important information. The aim of this thesis is to propose computationally feasible statistical methods for ...
متن کاملطراحی و روش نمونهگیری مطالعه آگاهی، نگرش و عملکرد خانوارها و کارکنان بهداشتی در خصوص تغذیه و ریزمغذیها در استانهای پایلوت برنامه
Background and Objectives:To compare three different methods of signal detection applied to the Adverse Drug Reactions registered in the Iranian Pharmacovigilance database from 1998 to 2005. Materials and Methods:All Adverse Drug Reactions (ADRs) reported to Iranian Pharmacovigilance Center from March 1998 through January 2005, were included in the analysis. The data were analyzed based on thr...
متن کامل